{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 用Tensorflow进行中文自然语言处理--情感分析"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$f('真好喝')=1$$\n",
    "$$f('太难喝了')=0$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**简介**  \n",
    "大家好，我是Espresso，这是我制作的第一个教程，是一个简单的中文自然语言处理的分类实践。  \n",
    "制作此教程的目的是什么呢？虽然现在自然语言处理的学习资料很多，英文的资料更多，但是网上资源很乱，尤其是中文的系统的学习资料稀少，而且知识点非常分散，缺少比较系统的实践学习资料，就算有一些代码但因为缺少注释导致要花费很长时间才能理解，我个人在学习过程中，在网络搜索花费了一整天时间，才把处理中文的步骤和需要的软件梳理出来。  \n",
    "所以我觉得自己有义务制作一个入门教程把零散的资料结合成一个实践案例方便各位同学学习，在这个教程中我注重的是实践部分，理论部分我推荐学习deeplearning.ai的课程，在下面的代码部分，涉及到哪方面知识的，我推荐一些学习资料并附上链接，如有侵权请e-mail：a66777@188.com。  \n",
    "另外我对自然语言处理并没有任何深入研究，欢迎各位大牛吐槽，希望能指出不足和改善方法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**需要的库**  \n",
    "numpy  \n",
    "jieba  \n",
    "gensim  \n",
    "tensorflow  \n",
    "matplotlib  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "c:\\users\\jinan\\appdata\\local\\programs\\python\\python36\\lib\\site-packages\\gensim\\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serial\n",
      "  warnings.warn(\"detected Windows; aliasing chunkize to chunkize_serial\")\n"
     ]
    }
   ],
   "source": [
    "# 首先加载必用的库\n",
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import re\n",
    "import jieba # 结巴分词\n",
    "# gensim用来加载预训练word vector\n",
    "from gensim.models import KeyedVectors\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**预训练词向量**  \n",
    "本教程使用了北京师范大学中文信息处理研究所与中国人民大学 DBIIR 实验室的研究者开源的\"chinese-word-vectors\" github链接为：  \n",
    "https://github.com/Embedding/Chinese-Word-Vectors  \n",
    "如果你不知道word2vec是什么，我推荐以下一篇文章：  \n",
    "https://zhuanlan.zhihu.com/p/26306795  \n",
    "这里我们使用了\"chinese-word-vectors\"知乎Word + Ngram的词向量，可以从上面github链接下载，我们先加载预训练模型并进行一些简单测试："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 使用gensim加载预训练中文分词embedding\n",
    "cn_model = KeyedVectors.load_word2vec_format('chinese_word_vectors/sgns.zhihu.bigram', \n",
    "                                          binary=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**词向量模型**  \n",
    "在这个词向量模型里，每一个词是一个索引，对应的是一个长度为300的向量，我们今天需要构建的LSTM神经网络模型并不能直接处理汉字文本，需要先进行分次并把词汇转换为词向量，步骤请参考下图，步骤的讲解会跟着代码一步一步来，如果你不知道RNN，GRU，LSTM是什么，我推荐deeplearning.ai的课程，网易公开课有免费中文字幕版，但我还是推荐有习题和练习代码部分的的coursera原版：  \n",
    "<img src='flowchart.jpg' style='width:400px;'>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "词向量的长度为300\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([-2.603470e-01,  3.677500e-01, -2.379650e-01,  5.301700e-02,\n",
       "       -3.628220e-01, -3.212010e-01, -1.903330e-01,  1.587220e-01,\n",
       "       -7.156200e-02, -4.625400e-02, -1.137860e-01,  3.515600e-01,\n",
       "       -6.408200e-02, -2.184840e-01,  3.286950e-01, -7.110330e-01,\n",
       "        1.620320e-01,  1.627490e-01,  5.528180e-01,  1.016860e-01,\n",
       "        1.060080e-01,  7.820700e-01, -7.537310e-01, -2.108400e-02,\n",
       "       -4.758250e-01, -1.130420e-01, -2.053000e-01,  6.624390e-01,\n",
       "        2.435850e-01,  9.171890e-01, -2.090610e-01, -5.290000e-02,\n",
       "       -7.969340e-01,  2.394940e-01, -9.028100e-02,  1.537360e-01,\n",
       "       -4.003980e-01, -2.456100e-02, -1.717860e-01,  2.037790e-01,\n",
       "       -4.344710e-01, -3.850430e-01, -9.366000e-02,  3.775310e-01,\n",
       "        2.659690e-01,  8.879800e-02,  2.493440e-01,  4.914900e-02,\n",
       "        5.996000e-03,  3.586430e-01, -1.044960e-01, -5.838460e-01,\n",
       "        3.093280e-01, -2.828090e-01, -8.563400e-02, -5.745400e-02,\n",
       "       -2.075230e-01,  2.845980e-01,  1.414760e-01,  1.678570e-01,\n",
       "        1.957560e-01,  7.782140e-01, -2.359000e-01, -6.833100e-02,\n",
       "        2.560170e-01, -6.906900e-02, -1.219620e-01,  2.683020e-01,\n",
       "        1.678810e-01,  2.068910e-01,  1.987520e-01,  6.720900e-02,\n",
       "       -3.975290e-01, -7.123140e-01,  5.613200e-02,  2.586000e-03,\n",
       "        5.616910e-01,  1.157000e-03, -4.341190e-01,  1.977480e-01,\n",
       "        2.519540e-01,  8.835000e-03, -3.554600e-01, -1.573500e-02,\n",
       "       -2.526010e-01,  9.355900e-02, -3.962500e-02, -1.628350e-01,\n",
       "        2.980950e-01,  1.647900e-01, -5.454270e-01,  3.888790e-01,\n",
       "        1.446840e-01, -7.239600e-02, -7.597800e-02, -7.803000e-03,\n",
       "        2.020520e-01, -4.424750e-01,  3.911580e-01,  2.115100e-01,\n",
       "        6.516760e-01,  5.668030e-01,  5.065500e-02, -1.259650e-01,\n",
       "       -3.720640e-01,  2.330470e-01,  6.659900e-02,  8.300600e-02,\n",
       "        2.540460e-01, -5.279760e-01, -3.843280e-01,  3.366460e-01,\n",
       "        2.336500e-01,  3.564750e-01, -4.884160e-01, -1.183910e-01,\n",
       "        1.365910e-01,  2.293420e-01, -6.151930e-01,  5.212050e-01,\n",
       "        3.412000e-01,  5.757940e-01,  2.354480e-01, -3.641530e-01,\n",
       "        7.373400e-02,  1.007380e-01, -3.211410e-01, -3.040480e-01,\n",
       "       -3.738440e-01, -2.515150e-01,  2.633890e-01,  3.995490e-01,\n",
       "        4.461880e-01,  1.641110e-01,  1.449590e-01, -4.191540e-01,\n",
       "        2.297840e-01,  6.710600e-02,  3.316430e-01, -6.026500e-02,\n",
       "       -5.130610e-01,  1.472570e-01,  2.414060e-01,  2.011000e-03,\n",
       "       -3.823410e-01, -1.356010e-01,  3.112300e-01,  9.177830e-01,\n",
       "       -4.511630e-01,  1.272190e-01, -9.431600e-02, -8.216000e-03,\n",
       "       -3.835440e-01,  2.589400e-02,  6.374980e-01,  4.931630e-01,\n",
       "       -1.865070e-01,  4.076900e-01, -1.841000e-03,  2.213160e-01,\n",
       "        2.253950e-01, -2.159220e-01, -7.611480e-01, -2.305920e-01,\n",
       "        1.296890e-01, -1.304100e-01, -4.742270e-01,  2.275500e-02,\n",
       "        4.255050e-01,  1.570280e-01,  2.975300e-02,  1.931830e-01,\n",
       "        1.304340e-01, -3.179800e-02,  1.516650e-01, -2.154310e-01,\n",
       "       -4.681410e-01,  1.007326e+00, -6.698940e-01, -1.555240e-01,\n",
       "        1.797170e-01,  2.848660e-01,  6.216130e-01,  1.549510e-01,\n",
       "        6.225000e-02, -2.227800e-02,  2.561270e-01, -1.006380e-01,\n",
       "        2.807900e-02,  4.597710e-01, -4.077750e-01, -1.777390e-01,\n",
       "        1.920500e-02, -4.829300e-02,  4.714700e-02, -3.715200e-01,\n",
       "       -2.995930e-01, -3.719710e-01,  4.622800e-02, -1.436460e-01,\n",
       "        2.532540e-01, -9.334000e-02, -4.957400e-02, -3.803850e-01,\n",
       "        5.970110e-01,  3.578450e-01, -6.826000e-02,  4.735200e-02,\n",
       "       -3.707590e-01, -8.621300e-02, -2.556480e-01, -5.950440e-01,\n",
       "       -4.757790e-01,  1.079320e-01,  9.858300e-02,  8.540300e-01,\n",
       "        3.518370e-01, -1.306360e-01, -1.541590e-01,  1.166775e+00,\n",
       "        2.048860e-01,  5.952340e-01,  1.158830e-01,  6.774400e-02,\n",
       "        6.793920e-01, -3.610700e-01,  1.697870e-01,  4.118530e-01,\n",
       "        4.731000e-03, -7.516530e-01, -9.833700e-02, -2.312220e-01,\n",
       "       -7.043300e-02,  1.576110e-01, -4.780500e-02, -7.344390e-01,\n",
       "       -2.834330e-01,  4.582690e-01,  3.957010e-01, -8.484300e-02,\n",
       "       -3.472550e-01,  1.291660e-01,  3.838960e-01, -3.287600e-02,\n",
       "       -2.802220e-01,  5.257030e-01, -3.609300e-02, -4.842220e-01,\n",
       "        3.690700e-02,  3.429560e-01,  2.902490e-01, -1.624650e-01,\n",
       "       -7.513700e-02,  2.669300e-01,  5.778230e-01, -3.074020e-01,\n",
       "       -2.183790e-01, -2.834050e-01,  1.350870e-01,  1.490070e-01,\n",
       "        1.438400e-02, -2.509040e-01, -3.376100e-01,  1.291880e-01,\n",
       "       -3.808700e-01, -4.420520e-01, -2.512300e-01, -1.328990e-01,\n",
       "       -1.211970e-01,  2.532660e-01,  2.757050e-01, -3.382040e-01,\n",
       "        1.178070e-01,  3.860190e-01,  5.277960e-01,  4.581920e-01,\n",
       "        1.502310e-01,  1.226320e-01,  2.768540e-01, -4.502080e-01,\n",
       "       -1.992670e-01,  1.689100e-02,  1.188860e-01,  3.502440e-01,\n",
       "       -4.064770e-01,  2.610280e-01, -1.934990e-01, -1.625660e-01,\n",
       "        2.498400e-02, -1.867150e-01, -1.954400e-02, -2.281900e-01,\n",
       "       -3.417670e-01, -5.222770e-01, -9.543200e-02, -3.500350e-01,\n",
       "        2.154600e-02,  2.318040e-01,  5.395310e-01, -4.223720e-01],\n",
       "      dtype=float32)"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 由此可见每一个词都对应一个长度为300的向量\n",
    "embedding_dim = cn_model['山东大学'].shape[0]\n",
    "print('词向量的长度为{}'.format(embedding_dim))\n",
    "cn_model['山东大学']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cosine Similarity for Vector Space Models by Christian S. Perone\n",
    "http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.66128117"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 计算相似度\n",
    "cn_model.similarity('橘子', '橙子')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.66128117"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# dot（'橘子'/|'橘子'|， '橙子'/|'橙子'| ）\n",
    "np.dot(cn_model['橘子']/np.linalg.norm(cn_model['橘子']), \n",
    "cn_model['橙子']/np.linalg.norm(cn_model['橙子']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('高中', 0.7247823476791382),\n",
       " ('本科', 0.6768535375595093),\n",
       " ('研究生', 0.6244412660598755),\n",
       " ('中学', 0.6088204979896545),\n",
       " ('大学本科', 0.595908522605896),\n",
       " ('初中', 0.5883588790893555),\n",
       " ('读研', 0.5778335332870483),\n",
       " ('职高', 0.5767995119094849),\n",
       " ('大学毕业', 0.5767451524734497),\n",
       " ('师范大学', 0.5708829760551453)]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 找出最相近的词，余弦相似度\n",
    "cn_model.most_similar(positive=['大学'], topn=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "在 老师 会计师 程序员 律师 医生 老人 中:\n",
      "不是同一类别的词为: 老人\n"
     ]
    }
   ],
   "source": [
    "# 找出不同的词\n",
    "test_words = '老师 会计师 程序员 律师 医生 老人'\n",
    "test_words_result = cn_model.doesnt_match(test_words.split())\n",
    "print('在 '+test_words+' 中:\\n不是同一类别的词为: %s' %test_words_result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cn_model.most_similar(positive=['女人','出轨'], negative=['男人'], topn=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**训练语料**  \n",
    "本教程使用了谭松波老师的酒店评论语料，即使是这个语料也很难找到下载链接，在某博客还得花积分下载，而我不知道怎么赚取积分，后来好不容易找到一个链接但竟然是失效的，再后来尝试把链接粘贴到迅雷上终于下载了下来，希望大家以后多多分享资源。  \n",
    "训练样本分别被放置在两个文件夹里：\n",
    "分别的pos和neg，每个文件夹里有2000个txt文件，每个文件内有一段评语，共有4000个训练样本，这样大小的样本数据在NLP中属于非常迷你的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 获得样本的索引，样本存放于两个文件夹中，\n",
    "# 分别为 正面评价'pos'文件夹 和 负面评价'neg'文件夹\n",
    "# 每个文件夹中有2000个txt文件，每个文件中是一例评价\n",
    "import os\n",
    "pos_txts = os.listdir('pos')\n",
    "neg_txts = os.listdir('neg')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "样本总共: 4000\n"
     ]
    }
   ],
   "source": [
    "print( '样本总共: '+ str(len(pos_txts) + len(neg_txts)) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 现在我们将所有的评价内容放置到一个list里\n",
    "\n",
    "train_texts_orig = [] # 存储所有评价，每例评价为一条string\n",
    "\n",
    "# 添加完所有样本之后，train_texts_orig为一个含有4000条文本的list\n",
    "# 其中前2000条文本为正面评价，后2000条为负面评价\n",
    "\n",
    "for i in range(len(pos_txts)):\n",
    "    with open('pos/'+pos_txts[i], 'r', errors='ignore') as f:\n",
    "        text = f.read().strip()\n",
    "        train_texts_orig.append(text)\n",
    "        f.close()\n",
    "for i in range(len(neg_txts)):\n",
    "    with open('neg/'+neg_txts[i], 'r', errors='ignore') as f:\n",
    "        text = f.read().strip()\n",
    "        train_texts_orig.append(text)\n",
    "        f.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4000"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(train_texts_orig)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 我们使用tensorflow的keras接口来建模\n",
    "from tensorflow.python.keras.models import Sequential\n",
    "from tensorflow.python.keras.layers import Dense, GRU, Embedding, LSTM, Bidirectional\n",
    "from tensorflow.python.keras.preprocessing.text import Tokenizer\n",
    "from tensorflow.python.keras.preprocessing.sequence import pad_sequences\n",
    "from tensorflow.python.keras.optimizers import RMSprop\n",
    "from tensorflow.python.keras.optimizers import Adam\n",
    "from tensorflow.python.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**分词和tokenize**  \n",
    "首先我们去掉每个样本的标点符号，然后用jieba分词，jieba分词返回一个生成器，没法直接进行tokenize，所以我们将分词结果转换成一个list，并将它索引化，这样每一例评价的文本变成一段索引数字，对应着预训练词向量模型中的词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Building prefix dict from the default dictionary ...\n",
      "Loading model from cache C:\\Users\\jinan\\AppData\\Local\\Temp\\jieba.cache\n",
      "Loading model cost 0.672 seconds.\n",
      "Prefix dict has been built succesfully.\n"
     ]
    }
   ],
   "source": [
    "# 进行分词和tokenize\n",
    "# train_tokens是一个长长的list，其中含有4000个小list，对应每一条评价\n",
    "train_tokens = []\n",
    "for text in train_texts_orig:\n",
    "    # 去掉标点\n",
    "    text = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\']+|[+——！，。？、~@#￥%……&*（）]+\", \"\",text)\n",
    "    # 结巴分词\n",
    "    cut = jieba.cut(text)\n",
    "    # 结巴分词的输出结果为一个生成器\n",
    "    # 把生成器转换为list\n",
    "    cut_list = [ i for i in cut ]\n",
    "    for i, word in enumerate(cut_list):\n",
    "        try:\n",
    "            # 将词转换为索引index\n",
    "            cut_list[i] = cn_model.vocab[word].index\n",
    "        except KeyError:\n",
    "            # 如果词不在字典中，则输出0\n",
    "            cut_list[i] = 0\n",
    "    train_tokens.append(cut_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**索引长度标准化**  \n",
    "因为每段评语的长度是不一样的，我们如果单纯取最长的一个评语，并把其他评填充成同样的长度，这样十分浪费计算资源，所以我们取一个折衷的长度。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 获得所有tokens的长度\n",
    "num_tokens = [ len(tokens) for tokens in train_tokens ]\n",
    "num_tokens = np.array(num_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "71.4495"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 平均tokens的长度\n",
    "np.mean(num_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1540"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 最长的评价tokens的长度\n",
    "np.max(num_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEWCAYAAACXGLsWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAHXBJREFUeJzt3XmYXVWd7vHva0DGCGIKhAQo0CgiDagRaUFF4dogKNxWBgcMg532qqCAV4PYgl69QmOjOBuZIiKCiIKiNohw0QcFwxgEsXkYQgiSIFMYGgi+94+9ipwUVbVPDafOqTrv53nOU2evPaxfdlXO76y19l5btomIiBjK89odQEREdL4ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIiIhaSRbRFEnflvRvY3SszSQ9KmlKWb5c0gfG4tjleL+UNHusjjeMej8v6X5Jfx2DY+0iafFYxDWKGCzppW2ot+3/9niuJItA0p2SnpC0XNJDkq6U9EFJz/592P6g7f/T5LF2G2ob24tsr2v7mTGI/ThJ3+93/D1szx/tsYcZx6bAUcDWtl88wPp8AA6iXUkphifJIvq83fZUYHPgeOCTwKljXYmk1cb6mB1ic+Bvtpe2O5CIVkiyiFXYftj2hcD+wGxJ2wBIOkPS58v7aZJ+XlohD0j6raTnSToT2Az4Welm+oSk3vLN8VBJi4DfNJQ1Jo6XSLpa0sOSLpC0QanrOd/I+1ovknYHPgXsX+q7oax/tlurxPVpSXdJWirpe5LWK+v64pgtaVHpQjpmsHMjab2y/7JyvE+X4+8GXAJsUuI4o99+6wC/bFj/qKRNJK0h6SuSlpTXVyStMUjdh0u6WdKMsryXpOsbWoLb9js/H5d0Yzmf50hac6jf3RB/En3HXEPSl8p5uq90S67V+DuSdFQ5x/dKOrhh3xdJ+pmkRyT9sXTX/a6su6JsdkM5L/s37Dfg8aI9kixiQLavBhYDbxhg9VFlXQ+wEdUHtm0fCCyiaqWsa/vfG/Z5E/AK4J8GqfL9wCHAJsAK4KtNxPgr4P8C55T6thtgs4PK683AlsC6wNf7bbMz8HJgV+Azkl4xSJVfA9Yrx3lTiflg278G9gCWlDgO6hfnY/3Wr2t7CXAMsCOwPbAdsAPw6f6VqhorOgh4k+3Fkl4NnAb8K/Ai4DvAhf0SzX7A7sAWwLZlfxjkdzfIv7fRCcDLSqwvBaYDn2lY/+JybqYDhwLfkPTCsu4bwGNlm9nl1Xdu3ljeblfOyzlNHC/aIMkihrIE2GCA8qeBjYHNbT9t+7eun2TsONuP2X5ikPVn2r6pfLD+G7CfygD4KL0XOMn27bYfBY4GDujXqvms7Sds3wDcQPXBvYoSy/7A0baX274T+A/gwFHG9jnbS20vAz7b73iSdBJVgn1z2QbgX4Dv2L7K9jNlfOZJqsTT56u2l9h+APgZ1Yc8jOB3J0mlziNsP2B7OVWSPqBhs6fLv+Vp278AHgVeXs7bO4FjbT9u+2agmfGkAY/XxH7RIkkWMZTpwAMDlJ8I3AZcLOl2SXObONbdw1h/F7A6MK2pKIe2STle47FXo/pW3afx6qXHqVof/U0Dnj/AsaaPcWybNCyvD8wBvmj74YbyzYGjSlfSQ5IeAjbtt+9g/6aR/O56gLWBaxrq+1Up7/M32ysGqLOH6nw3/n7r/haGOl60SZJFDEjSa6k+CH/Xf135Zn2U7S2BtwNHStq1b/Ugh6xreWza8H4zqm+W91N1X6zdENcUVv2QqjvuEqoP18ZjrwDuq9mvv/tLTP2PdU+T+w8U50CxLWlYfhDYCzhd0k4N5XcDX7C9fsNrbdtn1wYx9O9uMPcDTwCvbKhvPdvNfHgvozrfMxrKNh1k2+hgSRaxCkkvkLQX8EPg+7YXDrDNXpJeWronHgGeKS+oPoS3HEHV75O0taS1gc8B55VLa/8CrClpT0mrU/XpN/bN3wf0DjFIezZwhKQtJK3LyjGOFYNsP6ASy7nAFyRNlbQ5cCTw/aH3XCXOF/UNrjfE9mlJPZKmUY0B9L8M+HKq7qqfSHpdKf4u8EFJr1NlnXJ+ptYFUfO7G5Dtv5c6vyxpw3Kc6ZIGG39q3PcZ4HzgOElrS9qKaqyn0Uj/ZmIcJVlEn59JWk71rfUY4CRgsCtQZgK/pupH/j3wzfKhBvBFqg/AhyR9fBj1nwmcQdV9siZwOFRXZwEfAk6h+hb/GNUAbZ8flZ9/k3TtAMc9rRz7CuAO4L+Bw4YRV6PDSv23U7W4flCOX8v2n6mSw+3l3GwCfB5YANwILASuLWX9972E6ndxoaTX2F5ANYbwdarWx22sHMCuM9TvbiifLPX8QdIj5RjNjiF8hGqw+q9Uv4uzqcZY+hwHzC/nZb8mjxnjTHn4UUSMJ0knAC+2Pe532cfIpWURES0laStJ25Yusx2oLoX9SbvjiuGZrHfTRkTnmErV9bQJsJTqkuML2hpRDFu6oSIiola6oSIiotaE7oaaNm2ae3t72x1GRMSEcs0119xvu6d+y5UmdLLo7e1lwYIF7Q4jImJCkXRX/VarSjdURETUSrKIiIhaSRYREVErySIiImolWURERK0ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIUemdexG9cy9qdxgR0WJJFhERUSvJIiIiak3oiQRj8urr2rrz+D2Hvc9w94uIemlZRERErSSLiIiolWQRERG1kiwiIqJWkkVERNRKsoiIiFpJFhERUSvJIiIiaiVZxISReagi2ifJIiIiarUsWUg6TdJSSTc1lJ0o6c+SbpT0E0nrN6w7WtJtkm6V9E+tiisiIoavlS2LM4Dd+5VdAmxje1vgL8DRAJK2Bg4AXln2+aakKS2MLSIihqFlycL2FcAD/coutr2iLP4BmFHe7w380PaTtu8AbgN2aFVsERExPO0cszgE+GV5Px24u2Hd4lL2HJLmSFogacGyZctaHGJERECbkoWkY4AVwFl9RQNs5oH2tT3P9izbs3p6eloVYkRENBj351lImg3sBexquy8hLAY2bdhsBrBkvGOLyWMkz8OIiMGNa8tC0u7AJ4F32H68YdWFwAGS1pC0BTATuHo8Y4uIiMG1rGUh6WxgF2CapMXAsVRXP60BXCIJ4A+2P2j7T5LOBW6m6p76sO1nWhVbREQMT8uShe13D1B86hDbfwH4Qqviic6RLqKIiSd3cEdERK0ki4iIqJVkERERtZIsIiKi1rjfZxExlExBHtGZkiyiqzQmo1yNFdG8dENFREStJIuIiKiVbqiY8DLOEdF6aVlEREStJIuIiKiVbqjoCMPpSsrcUhHjLy2LiIiolWQRERG1kiwiIqJWkkVERNRKsoiIiFpJFhERUSvJIiIiaiVZRERErSSLiIiolTu4oys0c4d47gyPGFxaFhERUatlLQtJpwF7AUttb1PKNgDOAXqBO4H9bD8oScDJwNuAx4GDbF/bqtiiPfp/u8/U4hETRytbFmcAu/crmwtcansmcGlZBtgDmFlec4BvtTCuiFX0zr0oiSuiRsuShe0rgAf6Fe8NzC/v5wP7NJR/z5U/AOtL2rhVsUVExPCM95jFRrbvBSg/Nyzl04G7G7ZbXMqeQ9IcSQskLVi2bFlLg42IiEqnDHBrgDIPtKHtebZn2Z7V09PT4rAiIgLGP1nc19e9VH4uLeWLgU0btpsBLBnn2CIiYhDjnSwuBGaX97OBCxrK36/KjsDDfd1VERHRfq28dPZsYBdgmqTFwLHA8cC5kg4FFgH7ls1/QXXZ7G1Ul84e3Kq4IiJi+FqWLGy/e5BVuw6wrYEPtyqWiIgYnUz3ERNW7o2IGD+1YxaSdpK0Tnn/PkknSdq89aFFRESnaGaA+1vA45K2Az4B3AV8r6VRxaSQO6MjJo9muqFW2LakvYGTbZ8qaXbtXjHhdcIsrKNNNklWEWOjmWSxXNLRwPuAN0qaAqze2rAiIqKTNNMNtT/wJHCo7b9STcNxYkujioiIjlLbsigJ4qSG5UVkzCIioqs0czXUP0v6L0kPS3pE0nJJj4xHcBER0RmaGbP4d+Dttm9pdTAREdGZmhmzuC+JIiKiuzXTslgg6Rzgp1QD3QDYPr9lUUVEREdpJlm8gGpyv7c2lBlIsoiI6BLNXA2VGWAjIrpcM1dDvUzSpZJuKsvbSvp060OLiIhO0cwA93eBo4GnAWzfCBzQyqAiIqKzNJMs1rZ9db+yFa0IJiIiOlMzyeJ+SS+hGtRG0ruAPPI0ulJm0o1u1czVUB8G5gFbSboHuINqUsGISakTZtuN6DTNJIt7bO9WHoD0PNvLJW3Q6sAiIqJzNNMNdb6k1Ww/VhLFi4FLWh1YdI50vUREM8nip8B5kqZI6gUupro6KiIiukQzN+V9V9LzqZJGL/Cvtq9sdWAREdE5Bk0Wko5sXAQ2Ba4HdpS0o+2TBt6znqQjgA9QXWG1EDgY2Bj4IbABcC1woO2nRlpHRESMnaG6oaY2vNYFfgLc1lA2IpKmA4cDs2xvA0yhusnvBODLtmcCDwKHjrSOiIgYW4O2LGx/tnFZ0tSq2I+OUb1rSXoaWJvqvo23AO8p6+cDxwHfGoO6IiJilGrHLCRtA5xJ1T2EpPuB99v+00gqtH2PpC8Bi4AnqAbMrwEest13Z/hiqmd9DxTPHGAOwGabbTaSEKJGrnyKiP6auRpqHnCk7c1tbw4cRTVf1IhIeiGwN7AFsAmwDrDHAJt6oP1tz7M9y/asnp6ekYYRERHD0MxNeevYvqxvwfbl5Qa9kdoNuMP2MgBJ5wOvB9Yv93OsAGYAS0ZRR0SttKAimtdMsrhd0r9RdUVBNdXHHaOocxHVFVVrU3VD7QosAC4D3kV1RdRs4IJR1BEt0PjhmqkwIrpLM91QhwA9VE/GOx+YBhw00gptXwWcR3V57MISwzzgk8CRkm4DXgScOtI6IiJibDXTstjN9uGNBZL2BX400kptHwsc26/4dmCHkR4zIiJap5mWxUBTe2S6j4iILjLUHdx7AG8Dpkv6asOqF5CHH0WXyWB4dLuhuqGWUA08v4PqPog+y4EjWhlURCdIgohYaag7uG8AbpD0A9tPj2NMERHRYWrHLJIoIiKimQHuiIjocoMmC0lnlp8fHb9wIiKiEw3VsniNpM2BQyS9UNIGja/xCjAiItpvqKuhvg38CtiS6mooNaxzKY+IiC4waMvC9ldtvwI4zfaWtrdoeCVRRER0kWaewf2/JG0HvKEUXWH7xtaGFZ0u9yBEdJfaq6EkHQ6cBWxYXmdJOqzVgUVEROdoZiLBDwCvs/0YgKQTgN8DX2tlYBER0Tmauc9CwDMNy8+w6mB3RERMcs20LE4HrpL0k7K8D3nWREREV2lmgPskSZcDO1O1KA62fV2rA4vx0TdQnSffRcRQmmlZYPtaqifbRUREF8rcUBERUSvJIiIiag2ZLCRNkfTr8QomIiI605DJwvYzwOOS1huneCIiogM1M8D938BCSZcAj/UV2j68ZVFFTBCN057kirKYzJpJFheVV0REdKlm7rOYL2ktYDPbt45FpZLWB04BtqGa7vwQ4FbgHKAXuBPYz/aDY1FfRESMTjMTCb4duJ7q2RZI2l7ShaOs92TgV7a3ArYDbgHmApfanglcWpajRXrnXpSZYyOiac1cOnscsAPwEIDt64EtRlqhpBcAb6RMGWL7KdsPAXsD88tm86mmFYmIiA7QTLJYYfvhfmUeRZ1bAsuA0yVdJ+kUSesAG9m+F6D83HCgnSXNkbRA0oJly5aNIoyIiGhWM8niJknvAaZIminpa8CVo6hzNeDVwLdsv4rqCqumu5xsz7M9y/asnp6eUYQRERHNaiZZHAa8EngSOBt4BPjYKOpcDCy2fVVZPo8qedwnaWOA8nPpKOqIaKmM+US3aeZqqMeBY8pDj2x7+WgqtP1XSXdLenm5umpX4Obymg0cX35eMJp6onN0y4dqZvCNyaw2WUh6LXAaMLUsPwwcYvuaUdR7GNXjWZ8P3A4cTNXKOVfSocAiYN9RHD8iIsZQMzflnQp8yPZvASTtTPVApG1HWmm5omrWAKt2HekxIyKidZpJFsv7EgWA7d9JGlVXVExe3dLlFNFtBk0Wkl5d3l4t6TtUg9sG9gcub31oERHRKYZqWfxHv+VjG96P5j6LmITSooiY3AZNFrbfPJ6BRERE52rmaqj1gfdTTfD37PaZojwions0M8D9C+APwELg760NJyIiOlEzyWJN20e2PJKIiOhYzUz3caakf5G0saQN+l4tjywiIjpGMy2Lp4ATgWNYeRWUqWaPjYiILtBMsjgSeKnt+1sdTEREdKZmuqH+BDze6kAiJovMSBuTUTMti2eA6yVdRjVNOZBLZyMiukkzyeKn5RUREV2qmedZzK/bJiIiJrdm7uC+gwHmgrKdq6EiIrpEM91Qjc+dWJPqoUS5zyIioos00w31t35FX5H0O+AzrQkpYnLof0VUHrcaE1kz3VCvblh8HlVLY2rLIoqIiI7TTDdU43MtVgB3Avu1JJqIiOhIzXRD5bkWERFdrpluqDWAd/Lc51l8rnVhRUREJ2mmG+oC4GHgGhru4I6I4Wkc8M5gd0w0zSSLGbZ3H+uKJU0BFgD32N5L0hbAD6kuy70WOND2U2Ndb0REDF8zEwleKekfWlD3R4FbGpZPAL5seybwIHBoC+qMiIgRaCZZ7AxcI+lWSTdKWijpxtFUKmkGsCdwSlkW8BbgvLLJfGCf0dQRERFjp5luqD1aUO9XgE+w8n6NFwEP2V5RlhcD01tQb0REjEAzl87eNZYVStoLWGr7Gkm79BUPVPUg+88B5gBsttlmYxlaREQMopluqLG2E/AOSXdSDWi/haqlsb6kvuQ1A1gy0M6259meZXtWT0/PeMQbEdH1xj1Z2D7a9gzbvcABwG9svxe4DHhX2Ww21SW7ERHRAdrRshjMJ4EjJd1GNYZxapvjiYiIopkB7paxfTlweXl/O7BDO+OJiIiBdVLLIiIiOlRbWxYx/vo/YyEiohlpWURERK0ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIiIhaSRYREVErySIiImolWURERK0ki4iIqJVkEdFGvXMvyhQsMSEkWURERK1MJBjRBmlNxESTlkVEh0tXVXSCJIuIiKiVbqiIDtLYgrjz+D3bGEnEqtKyiIiIWkkWERFRK8kiIiJqJVlEREStJIuIiKg17ldDSdoU+B7wYuDvwDzbJ0vaADgH6AXuBPaz/eB4xxfRDrmPIjpdO1oWK4CjbL8C2BH4sKStgbnApbZnApeW5YiI6ADjnixs32v72vJ+OXALMB3YG5hfNpsP7DPesUVExMDaOmYhqRd4FXAVsJHte6FKKMCGg+wzR9ICSQuWLVs2XqFGtF2m/Yh2aluykLQu8GPgY7YfaXY/2/Nsz7I9q6enp3UBRkTEs9qSLCStTpUozrJ9fim+T9LGZf3GwNJ2xBYREc/VjquhBJwK3GL7pIZVFwKzgePLzwvGO7aITpIup+gk7ZhIcCfgQGChpOtL2aeoksS5kg4FFgH7tiG2iIgYwLgnC9u/AzTI6l3HM5aIiSgz00Y75A7uiIiolWQRMYHlctoYL3n40SSW7oqIGCtpWURERK0ki4hJIN1R0WpJFhERUSvJIiIiaiVZREwi6Y6KVkmyiIiIWkkWERFRK8kiIiJqJVlEREStJIuIiKiV6T4iJrFM+RJjJS2LiIiolWQRERG10g0VMQmN9Ma8vv3SZRX9pWURERG1kiwmkUz1EEPp//eRv5cYjnRDTRDpHoixkgQRI5GWRURE1EqyiIhBpasq+iRZRERErY4bs5C0O3AyMAU4xfbxbQ4pYlIbqOXQvyxjZtFRyULSFOAbwP8AFgN/lHSh7ZvHO5Z2TJMwnDozjUNMBMNJMklIna3TuqF2AG6zfbvtp4AfAnu3OaaIiK4n2+2O4VmS3gXsbvsDZflA4HW2P9KwzRxgTlncBrhp3APtTNOA+9sdRIfIuVgp52KlnIuVXm576nB26KhuKEADlK2SzWzPA+YBSFpge9Z4BNbpci5WyrlYKedipZyLlSQtGO4+ndYNtRjYtGF5BrCkTbFERETRacnij8BMSVtIej5wAHBhm2OKiOh6HdUNZXuFpI8A/0l16exptv80xC7zxieyCSHnYqWci5VyLlbKuVhp2Oeiowa4IyKiM3VaN1RERHSgJIuIiKg1YZOFpN0l3SrpNklz2x1Pu0jaVNJlkm6R9CdJH213TO0kaYqk6yT9vN2xtJuk9SWdJ+nP5e/jH9sdU7tIOqL8/7hJ0tmS1mx3TONF0mmSlkq6qaFsA0mXSPqv8vOFdceZkMmiYVqQPYCtgXdL2rq9UbXNCuAo268AdgQ+3MXnAuCjwC3tDqJDnAz8yvZWwHZ06XmRNB04HJhlexuqi2cOaG9U4+oMYPd+ZXOBS23PBC4ty0OakMmCTAvyLNv32r62vF9O9YEwvb1RtYekGcCewCntjqXdJL0AeCNwKoDtp2w/1N6o2mo1YC1JqwFr00X3b9m+AnigX/HewPzyfj6wT91xJmqymA7c3bC8mC79gGwkqRd4FXBVeyNpm68AnwD+3u5AOsCWwDLg9NItd4qkddodVDvYvgf4ErAIuBd42PbF7Y2q7TayfS9UXziBDet2mKjJonZakG4jaV3gx8DHbD/S7njGm6S9gKW2r2l3LB1iNeDVwLdsvwp4jCa6Giaj0h+/N7AFsAmwjqT3tTeqiWeiJotMC9JA0upUieIs2+e3O5422Ql4h6Q7qbol3yLp++0Nqa0WA4tt97Uyz6NKHt1oN+AO28tsPw2cD7y+zTG1232SNgYoP5fW7TBRk0WmBSkkiapf+hbbJ7U7nnaxfbTtGbZ7qf4efmO7a7892v4rcLekl5eiXYFxfy5Mh1gE7Chp7fL/ZVe6dLC/wYXA7PJ+NnBB3Q4dNd1Hs0YwLchkthNwILBQ0vWl7FO2f9HGmKIzHAacVb5Q3Q4c3OZ42sL2VZLOA66lunrwOrpo6g9JZwO7ANMkLQaOBY4HzpV0KFUy3bf2OJnuIyIi6kzUbqiIiBhHSRYREVErySIiImolWURERK0ki4iIqJVkEROWpEdbcMztJb2tYfk4SR8fxfH2LTO+XtavvFfSe5rY/yBJXx9p/RFjJckiYlXbA2+r3ap5hwIfsv3mfuW9QG2yiOgUSRYxKUj635L+KOlGSZ8tZb3lW/13y7MMLpa0Vln32rLt7yWdWJ5z8Hzgc8D+kq6XtH85/NaSLpd0u6TDB6n/3ZIWluOcUMo+A+wMfFvSif12OR54Q6nnCElrSjq9HOM6Sf2TC5L2LPFOk9Qj6cfl3/xHSTuVbY4rzy9YJV5J60i6SNINJcb9+x8/Yki288prQr6AR8vPt1LdkSuqL0A/p5qeu5fqjt3ty3bnAu8r728CXl/eHw/cVN4fBHy9oY7jgCuBNYBpwN+A1fvFsQnVXbA9VLMi/AbYp6y7nOo5Cv1j3wX4ecPyUcDp5f1W5Xhr9sUD/E/gt8ALyzY/AHYu7zejmu5l0HiBdwLfbahvvXb//vKaWK8JOd1HRD9vLa/ryvK6wEyqD9w7bPdNg3IN0CtpfWCq7StL+Q+AvYY4/kW2nwSelLQU2Ihqor4+rwUut70MQNJZVMnqp8P4N+wMfA3A9p8l3QW8rKx7MzALeKtXzii8G1WLp2//F0iaOkS8C4EvlVbPz23/dhixRSRZxKQg4Iu2v7NKYfV8jycbip4B1mLgKe6H0v8Y/f/fDPd4AxnqGLdTPZ/iZcCCUvY84B9tP7HKQark8Zx4bf9F0muoxmO+KOli258bg7ijS2TMIiaD/wQOKc/0QNJ0SYM+zMX2g8BySTuWosZHbC4Hpj53ryFdBbypjCVMAd4N/L+affrXcwXw3hL/y6i6lm4t6+4C/hn4nqRXlrKLgY/07Sxp+6Eqk7QJ8Ljt71M9CKhbpyuPEUqyiAnP1VPPfgD8XtJCqmc31H3gHwrMk/R7qm/1D5fyy6i6d65vdhDY1ZPGji773gBca7tuyucbgRVlwPkI4JvAlBL/OcBBpSupr45bqZLJjyS9hPJM6TJIfzPwwZr6/gG4usxMfAzw+Wb+bRF9MutsdCVJ69p+tLyfC2xs+6NtDiuiY2XMIrrVnpKOpvo/cBfVVUcRMYi0LCIiolbGLCIiolaSRURE1EqyiIiIWkkWERFRK8kiIiJq/X+K7ei/t/dDegAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.hist(np.log(num_tokens), bins = 100)\n",
    "plt.xlim((0,10))\n",
    "plt.ylabel('number of tokens')\n",
    "plt.xlabel('length of tokens')\n",
    "plt.title('Distribution of tokens length')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "236"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 取tokens平均值并加上两个tokens的标准差，\n",
    "# 假设tokens长度的分布为正态分布，则max_tokens这个值可以涵盖95%左右的样本\n",
    "max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)\n",
    "max_tokens = int(max_tokens)\n",
    "max_tokens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9565"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 取tokens的长度为236时，大约95%的样本被涵盖\n",
    "# 我们对长度不足的进行padding，超长的进行修剪\n",
    "np.sum( num_tokens < max_tokens ) / len(num_tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**反向tokenize**  \n",
    "我们定义一个function，用来把索引转换成可阅读的文本，这对于debug很重要。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 用来将tokens转换为文本\n",
    "def reverse_tokens(tokens):\n",
    "    text = ''\n",
    "    for i in tokens:\n",
    "        if i != 0:\n",
    "            text = text + cn_model.index2word[i]\n",
    "        else:\n",
    "            text = text + ' '\n",
    "    return text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "reverse = reverse_tokens(train_tokens[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "以下可见，训练样本的极性并不是那么精准，比如说下面的样本，对早餐并不满意，但被定义为正面评价，这会迷惑我们的模型，不过我们暂时不对训练样本进行任何修改。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'早餐太差无论去多少人那边也不加食品的酒店应该重视一下这个问题了房间本身很好'"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 经过tokenize再恢复成文本\n",
    "# 可见标点符号都没有了\n",
    "reverse"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'早餐太差，无论去多少人，那边也不加食品的。酒店应该重视一下这个问题了。\\n\\n房间本身很好。'"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 原始文本\n",
    "train_texts_orig[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**准备Embedding Matrix**  \n",
    "现在我们来为模型准备embedding matrix（词向量矩阵），根据keras的要求，我们需要准备一个维度为$(numwords, embeddingdim)$的矩阵，num words代表我们使用的词汇的数量，emdedding dimension在我们现在使用的预训练词向量模型中是300，每一个词汇都用一个长度为300的向量表示。  \n",
    "注意我们只选择使用前50k个使用频率最高的词，在这个预训练词向量模型中，一共有260万词汇量，如果全部使用在分类问题上会很浪费计算资源，因为我们的训练样本很小，一共只有4k，如果我们有100k，200k甚至更多的训练样本时，在分类问题上可以考虑减少使用的词汇量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "300"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embedding_dim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 只使用前20000个词\n",
    "num_words = 50000\n",
    "# 初始化embedding_matrix，之后在keras上进行应用\n",
    "embedding_matrix = np.zeros((num_words, embedding_dim))\n",
    "# embedding_matrix为一个 [num_words，embedding_dim] 的矩阵\n",
    "# 维度为 50000 * 300\n",
    "for i in range(num_words):\n",
    "    embedding_matrix[i,:] = cn_model[cn_model.index2word[i]]\n",
    "embedding_matrix = embedding_matrix.astype('float32')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "300"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 检查index是否对应，\n",
    "# 输出300意义为长度为300的embedding向量一一对应\n",
    "np.sum( cn_model[cn_model.index2word[333]] == embedding_matrix[333] )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(50000, 300)"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# embedding_matrix的维度，\n",
    "# 这个维度为keras的要求，后续会在模型中用到\n",
    "embedding_matrix.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**padding（填充）和truncating（修剪）**  \n",
    "我们把文本转换为tokens（索引）之后，每一串索引的长度并不相等，所以为了方便模型的训练我们需要把索引的长度标准化，上面我们选择了236这个可以涵盖95%训练样本的长度，接下来我们进行padding和truncating，我们一般采用'pre'的方法，这会在文本索引的前面填充0，因为根据一些研究资料中的实践，如果在文本索引后面填充0的话，会对模型造成一些不良影响。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 进行padding和truncating， 输入的train_tokens是一个list\n",
    "# 返回的train_pad是一个numpy array\n",
    "train_pad = pad_sequences(train_tokens, maxlen=max_tokens,\n",
    "                            padding='pre', truncating='pre')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 超出五万个词向量的词用0代替\n",
    "train_pad[ train_pad>=num_words ] = 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "         290,  3053,    57,   169,    73,     1,    25, 11216,    49,\n",
       "         163, 15985,     0,     0,    30,     8,     0,     1,   228,\n",
       "         223,    40,    35,   653,     0,     5,  1642,    29, 11216,\n",
       "        2751,   500,    98,    30,  3159,  2225,  2146,   371,  6285,\n",
       "         169, 27396,     1,  1191,  5432,  1080, 20055,    57,   562,\n",
       "           1, 22671,    40,    35,   169,  2567,     0, 42665,  7761,\n",
       "         110,     0,     0, 41281,     0,   110,     0, 35891,   110,\n",
       "           0, 28781,    57,   169,  1419,     1, 11670,     0, 19470,\n",
       "           1,     0,     0,   169, 35071,    40,   562,    35, 12398,\n",
       "         657,  4857])"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 可见padding之后前面的tokens全变成0，文本在最后面\n",
    "train_pad[33]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 准备target向量，前2000样本为1，后2000为0\n",
    "train_target = np.concatenate( (np.ones(2000),np.zeros(2000)) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 进行训练和测试样本的分割\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 90%的样本用来训练，剩余10%用来测试\n",
    "X_train, X_test, y_train, y_test = train_test_split(train_pad,\n",
    "                                                    train_target,\n",
    "                                                    test_size=0.1,\n",
    "                                                    random_state=12)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                                                                                                                                                                                                                        房间很大还有海景阳台走出酒店就是沙滩非常不错唯一遗憾的就是不能刷 不方便\n",
      "class:  1.0\n"
     ]
    }
   ],
   "source": [
    "# 查看训练样本，确认无误\n",
    "print(reverse_tokens(X_train[35]))\n",
    "print('class: ',y_train[35])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在我们用keras搭建LSTM模型，模型的第一层是Embedding层，只有当我们把tokens索引转换为词向量矩阵之后，才可以用神经网络对文本进行处理。\n",
    "keras提供了Embedding接口，避免了繁琐的稀疏矩阵操作。   \n",
    "在Embedding层我们输入的矩阵为：$$(batchsize, maxtokens)$$\n",
    "输出矩阵为： $$(batchsize, maxtokens, embeddingdim)$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 用LSTM对样本进行分类\n",
    "model = Sequential()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 模型第一层为embedding\n",
    "model.add(Embedding(num_words,\n",
    "                    embedding_dim,\n",
    "                    weights=[embedding_matrix],\n",
    "                    input_length=max_tokens,\n",
    "                    trainable=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.add(Bidirectional(LSTM(units=32, return_sequences=True)))\n",
    "model.add(LSTM(units=16, return_sequences=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**构建模型**  \n",
    "我在这个教程中尝试了几种神经网络结构，因为训练样本比较少，所以我们可以尽情尝试，训练过程等待时间并不长：  \n",
    "**GRU：**如果使用GRU的话，测试样本可以达到87%的准确率，但我测试自己的文本内容时发现，GRU最后一层激活函数的输出都在0.5左右，说明模型的判断不是很明确，信心比较低，而且经过测试发现模型对于否定句的判断有时会失误，我们期望对于负面样本输出接近0，正面样本接近1而不是都徘徊于0.5之间。  \n",
    "**BiLSTM：**测试了LSTM和BiLSTM，发现BiLSTM的表现最好，LSTM的表现略好于GRU，这可能是因为BiLSTM对于比较长的句子结构有更好的记忆，有兴趣的朋友可以深入研究一下。  \n",
    "Embedding之后第，一层我们用BiLSTM返回sequences，然后第二层16个单元的LSTM不返回sequences，只返回最终结果，最后是一个全链接层，用sigmoid激活函数输出结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [],
   "source": [
    "# GRU的代码\n",
    "# model.add(GRU(units=32, return_sequences=True))\n",
    "# model.add(GRU(units=16, return_sequences=True))\n",
    "# model.add(GRU(units=4, return_sequences=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.add(Dense(1, activation='sigmoid'))\n",
    "# 我们使用adam以0.001的learning rate进行优化\n",
    "optimizer = Adam(lr=1e-3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.compile(loss='binary_crossentropy',\n",
    "              optimizer=optimizer,\n",
    "              metrics=['accuracy'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_1 (Embedding)      (None, 236, 300)          15000000  \n",
      "_________________________________________________________________\n",
      "bidirectional_1 (Bidirection (None, 236, 64)           85248     \n",
      "_________________________________________________________________\n",
      "lstm_2 (LSTM)                (None, 16)                5184      \n",
      "_________________________________________________________________\n",
      "dense_1 (Dense)              (None, 1)                 17        \n",
      "=================================================================\n",
      "Total params: 15,090,449\n",
      "Trainable params: 90,449\n",
      "Non-trainable params: 15,000,000\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "# 我们来看一下模型的结构，一共90k左右可训练的变量\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 建立一个权重的存储点\n",
    "path_checkpoint = 'sentiment_checkpoint.keras'\n",
    "checkpoint = ModelCheckpoint(filepath=path_checkpoint, monitor='val_loss',\n",
    "                                      verbose=1, save_weights_only=True,\n",
    "                                      save_best_only=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 尝试加载已训练模型\n",
    "try:\n",
    "    model.load_weights(path_checkpoint)\n",
    "except Exception as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 定义early stoping如果3个epoch内validation loss没有改善则停止训练\n",
    "earlystopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 自动降低learning rate\n",
    "lr_reduction = ReduceLROnPlateau(monitor='val_loss',\n",
    "                                       factor=0.1, min_lr=1e-5, patience=0,\n",
    "                                       verbose=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 定义callback函数\n",
    "callbacks = [\n",
    "    earlystopping, \n",
    "    checkpoint,\n",
    "    lr_reduction\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# 开始训练\n",
    "model.fit(X_train, y_train,\n",
    "          validation_split=0.1, \n",
    "          epochs=20,\n",
    "          batch_size=128,\n",
    "          callbacks=callbacks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**结论**  \n",
    "我们首先对测试样本进行预测，得到了还算满意的准确度。  \n",
    "之后我们定义一个预测函数，来预测输入的文本的极性，可见模型对于否定句和一些简单的逻辑结构都可以进行准确的判断。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "400/400 [==============================] - 5s 12ms/step\n",
      "Accuracy:87.50%\n"
     ]
    }
   ],
   "source": [
    "result = model.evaluate(X_test, y_test)\n",
    "print('Accuracy:{0:.2%}'.format(result[1]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict_sentiment(text):\n",
    "    print(text)\n",
    "    # 去标点\n",
    "    text = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\']+|[+——！，。？、~@#￥%……&*（）]+\", \"\",text)\n",
    "    # 分词\n",
    "    cut = jieba.cut(text)\n",
    "    cut_list = [ i for i in cut ]\n",
    "    # tokenize\n",
    "    for i, word in enumerate(cut_list):\n",
    "        try:\n",
    "            cut_list[i] = cn_model.vocab[word].index\n",
    "        except KeyError:\n",
    "            cut_list[i] = 0\n",
    "    # padding\n",
    "    tokens_pad = pad_sequences([cut_list], maxlen=max_tokens,\n",
    "                           padding='pre', truncating='pre')\n",
    "    # 预测\n",
    "    result = model.predict(x=tokens_pad)\n",
    "    coef = result[0][0]\n",
    "    if coef >= 0.5:\n",
    "        print('是一例正面评价','output=%.2f'%coef)\n",
    "    else:\n",
    "        print('是一例负面评价','output=%.2f'%coef)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "酒店设施不是新的，服务态度很不好\n",
      "是一例负面评价 output=0.14\n",
      "酒店卫生条件非常不好\n",
      "是一例负面评价 output=0.09\n",
      "床铺非常舒适\n",
      "是一例正面评价 output=0.76\n",
      "房间很凉，不给开暖气\n",
      "是一例负面评价 output=0.17\n",
      "房间很凉爽，空调冷气很足\n",
      "是一例正面评价 output=0.66\n",
      "酒店环境不好，住宿体验很不好\n",
      "是一例负面评价 output=0.06\n",
      "房间隔音不到位\n",
      "是一例负面评价 output=0.17\n",
      "晚上回来发现没有打扫卫生\n",
      "是一例负面评价 output=0.25\n",
      "因为过节所以要我临时加钱，比团购的价格贵\n",
      "是一例负面评价 output=0.06\n"
     ]
    }
   ],
   "source": [
    "test_list = [\n",
    "    '酒店设施不是新的，服务态度很不好',\n",
    "    '酒店卫生条件非常不好',\n",
    "    '床铺非常舒适',\n",
    "    '房间很凉，不给开暖气',\n",
    "    '房间很凉爽，空调冷气很足',\n",
    "    '酒店环境不好，住宿体验很不好',\n",
    "    '房间隔音不到位' ,\n",
    "    '晚上回来发现没有打扫卫生',\n",
    "    '因为过节所以要我临时加钱，比团购的价格贵'\n",
    "]\n",
    "for text in test_list:\n",
    "    predict_sentiment(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**错误分类的文本**\n",
    "经过查看，发现错误分类的文本的含义大多比较含糊，就算人类也不容易判断极性，如index为101的这个句子，好像没有一点满意的成分，但这例子评价在训练样本中被标记成为了正面评价，而我们的模型做出的负面评价的预测似乎是合理的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred = model.predict(X_test)\n",
    "y_pred = y_pred.T[0]\n",
    "y_pred = [1 if p>= 0.5 else 0 for p in y_pred]\n",
    "y_pred = np.array(y_pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_actual = np.array(y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 找出错误分类的索引\n",
    "misclassified = np.where( y_pred != y_actual )[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "400\n"
     ]
    }
   ],
   "source": [
    "# 输出所有错误分类的索引\n",
    "len(misclassified)\n",
    "print(len(X_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                                                                                                                                                        由于2007年 有一些新问题可能还没来得及解决我因为工作需要经常要住那里所以慎重的提出以下 ：1 后 的 淋浴喷头的位置都太高我换了房间还是一样很不好用2 后的一些管理和服务还很不到位尤其是前台入住和 时代效率太低每次 都超过10分钟好像不符合 宾馆的要求\n",
      "预测的分类 0\n",
      "实际的分类 1.0\n"
     ]
    }
   ],
   "source": [
    "# 我们来找出错误分类的样本看看\n",
    "idx=101\n",
    "print(reverse_tokens(X_test[idx]))\n",
    "print('预测的分类', y_pred[idx])\n",
    "print('实际的分类', y_actual[idx])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                                                                                                                                                                                                                 还是很 设施也不错但是 和以前 比急剧下滑了 和客房 的服务极差幸好我不是很在乎\n",
      "预测的分类 0\n",
      "实际的分类 1.0\n"
     ]
    }
   ],
   "source": [
    "idx=1\n",
    "print(reverse_tokens(X_test[idx]))\n",
    "print('预测的分类', y_pred[idx])\n",
    "print('实际的分类', y_actual[idx])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
